NSF PAR Search | NSF Public Access Repository

Does Quantization Improve Inference Speed? It Depends

Farrukh, Ahmed; Saeed, Mohamed; Fund, Fraida (May 2025, IEEE)

Quantization is often cited as a technique for reducing model size and accelerating deep learning. However, past literature suggests that the effect of quantization on latency varies significantly across different settings, in some cases even increasing inference time rather than reducing it. To address this discrepancy, we conduct a series of systematic experiments on the Chameleon testbed to investigate the impact of three key variables on the effect of post-training quantization: the machine learning framework, the compute hardware, and the model itself. Our experiments demonstrate that each of these has a substantial impact on the overall inference time of a quantized model. Furthermore, we make experiment materials and artifacts publicly available so that others can validate our findings on the same hardware using Chameleon, and we share open educational resources on this topic that may be adopted in formal and informal education settings.

Free, publicly-accessible full text available May 19, 2026

Search for: All records